16 research outputs found
A random forest system combination approach for error detection in digital dictionaries
When digitizing a print bilingual dictionary, whether via optical character
recognition or manual entry, it is inevitable that errors are introduced into
the electronic version that is created. We investigate automating the process
of detecting errors in an XML representation of a digitized print dictionary
using a hybrid approach that combines rule-based, feature-based, and language
model-based methods. We investigate combining methods and show that using
random forests is a promising approach. We find that in isolation, unsupervised
methods rival the performance of supervised methods. Random forests typically
require training data so we investigate how we can apply random forests to
combine individual base methods that are themselves unsupervised without
requiring large amounts of training data. Experiments reveal empirically that a
relatively small amount of data is sufficient and can potentially be further
reduced through specific selection criteria.Comment: 9 pages, 7 figures, 10 tables; appeared in Proceedings of the
Workshop on Innovative Hybrid Approaches to the Processing of Textual Data,
April 201
Multiple Alternative Sentene Compressions as a Tool for Automatic Summarization Tasks
Automatic summarization is the distillation of important information from a source into an abridged form for a particular user or task.
Many current systems summarize texts by selecting sentences with important content. The limitation of extraction at the sentence level
is that highly relevant sentences may also contain non-relevant and
redundant content.
This thesis presents a novel framework for text summarization that
addresses the limitations of sentence-level extraction. Under this
framework text summarization is performed by generating Multiple
Alternative Sentence Compressions (MASC) as candidate summary
components and using weighted features of the candidates to construct
summaries from them. Sentence compression is the rewriting of a
sentence in a shorter form. This framework provides an environment in
which hypotheses about summarization techniques can be tested.
Three approaches to sentence compression were developed under this
framework. The first approach, HMM Hedge, uses the Noisy Channel
Model to calculate the most likely compressions of a sentence. The
second approach, Trimmer, uses syntactic trimming rules that are
linguistically motivated by Headlinese, a form of compressed English
associated with newspaper headlines. The third approach, Topiary, is
a combination of fluent text with topic terms.
The MASC framework for automatic text summarization has been applied
to the tasks of headline generation and multi-document summarization,
and has been used for initial work in summarization of novel genres
and applications, including broadcast news, email threads,
cross-language, and structured queries. The framework supports
combinations of component techniques, fostering collaboration between
development teams.
Three results will be demonstrated under the MASC framework. The first is
that an extractive summarization system can produce better summaries
by automatically selecting from a pool of compressed sentence
candidates than by automatically selecting from unaltered source
sentences. The second result is that sentence selectors can construct
better summaries from pools of compressed candidates when they make
use of larger candidate feature sets. The third result is that for
the task of Headline Generation, a combination of topic terms and
compressed sentences performs better then either approach alone.
Experimental evidence supports all three results
Citation Handling for Improved Summarization of Scientific Documents
In this paper we present the first steps toward improving summarization
of scientific documents through citation analysis and parsing. Prior
work (Mohammad et al., 2009) argues that citation texts (sentences that
cite other papers) play a crucial role in automatic summarization of a
topical area, but did not take into account the noise introduced by the
citations themselves. We demonstrate that it is possible to improve
summarization output through careful handling of these citations. We
base our experiments on the application of an improved trimming approach
to summarization of citation texts extracted from Question-Answering and
Dependency-Parsing documents. We demonstrate that confidence scores from
the Stanford NLP Parser (Klein and Manning, 2003) are significantly
improved, and that Trimmer (Zajic et al., 2007), a sentence-compression
tool, is able to generate higher-quality candidates. Our summarization
output is currently used as part of a larger system, Action Science
Explorer (ASE) (Gove, 2011)
Correcting Errors in Digital Lexicographic Resources Using a Dictionary Manipulation Language
We describe a paradigm for combining manual and automatic error correction of noisy structured lexicographic data. Modifications to the structure and underlying text of the lexicographic data are expressed in a simple, interpreted programming language. Dictionary Manipulation Language (DML) commands identify nodes by unique identifiers, and manipulations are performed using simple commands such as create, move, set text, etc. Corrected lexicons are produced by applying sequences of DML commands to the source version of the lexicon. DML commands can be written manually to repair one-off errors or generated automatically to correct recurring problems. We discuss advantages of the paradigm for the task of editing digital bilingual dictionaries.This material is based upon work supported, in whole or in part, with funding from the United States Government. Any opinions, findings and conclusions, or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the University of Maryland, College Park and/or any agency or entity of the United States Government. Nothing in this report is intended to be and shall not be treated or construed as an endorsement or recommendation by the University of Maryland, United States Government, or the authors of the product, process, or service that is the subject of this report. No one may use any information contained or based on this report in advertisements or promotional materials related to any company product, process, or service or in support of other commercial purposes
Detecting Structural Irregularity in Electronic Dictionaries Using Language Modeling
Dictionaries are often developed using tools that save to Extensible Markup Language (XML)-based standards. These standards often allow high-level repeating elements to represent lexical entries, and utilize descendants of these repeating elements to represent the structure within each lexical entry, in the form of an XML tree. In many cases, dictionaries are published that have errors and inconsistencies that are expensive to find manually. This paper discusses a method for dictionary writers to quickly audit structural regularity across entries in a dictionary by using statistical language modeling. The approach learns the patterns of XML nodes that could occur within an XML tree, and then calculates the probability of each XML tree in the dictionary against these patterns to look for entries that diverge from the norm.This material is based upon work supported, in whole or in part, with funding from the United States Government. Any opinions, findings and conclusions, or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the University of Maryland, College Park and/or any agency or entity of the United States Government. Nothing in this report is intended to be and shall not be treated or construed as an endorsement or recommendation by the University of Maryland, United States Government, or the authors of the product, process, or service that is the subject of this report. No one may use any information contained or based on this report in advertisements or promotional materials related to any company product, process, or service or in support of other commercial purposes
A random forest system combination approach for error detection in digital dictionaries
When digitizing a print bilingual dictionary,
whether via optical character recognition or
manual entry, it is inevitable that errors are introduced into the electronic version that is created. We investigate automating the process of detecting errors in an XML representation of a digitized print dictionary using a hybrid approach that combines rule-based, feature-based, and language model-based methods. We investigate combining methods and show that using random
forests is a promising approach. We find
that in isolation, unsupervised methods rival the performance of supervised methods.
Random forests typically require training
data so we investigate how we can apply
random forests to combine individual base
methods that are themselves unsupervised
without requiring large amounts of training
data. Experiments reveal empirically that
a relatively small amount of data is sufficient and can potentially be further reduced through specific selection criteria
Geometric Analysis of the Doppler Frequency for General Non-Stationary 3D Mobile-to-Mobile Channels based on Prolate Spheroidal Coordinates
Mobile-to-mobile channels often exhibit timevariant Doppler frequency shifts due to the movement of transmitter and receiver. An accurate description of the Doppler
frequency turns out to be very difficult in Cartesian coordinates,
and any subsequent algebraic analysis of the Doppler frequency
is intractable. In contrast to other approaches, we base our
investigation on a geometric description of the Doppler frequency
with the following three mathematical pillars: prolate spheroidal
coordinate system, algebraic curve theory, and differential forms.
The prolate spheroidal coordinate system is more appropriate to
algebraically investigate the problem. After the transformation
into the new coordinate system, the theory of algebraic curves
is needed to resolve the ambiguities. Finally, the differential
forms are required to derive the joint delay Doppler probability
density function. This function is normalized by the equivalent
ellipsoidal area of the scattering plane bounded by the delay
ellipsoid. The results generalize in a natural way our previous
model to a complete 3D description. Our solutions enable insight
into the geometry of the Doppler frequency and we were able to
derive a Doppler frequency that is dependent on the delay and
the scattering plane. The presented theory allows describing any
time-variant, single-bounce, mobile-to-mobile scattering channel